{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering Code Along\n",
    "\n",
    "We'll be working with a real data set about seeds, from UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for \n",
    "the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. \n",
    "\n",
    "The data set can be used for the tasks of classification and cluster analysis.\n",
    "\n",
    "\n",
    "Attribute Information:\n",
    "\n",
    "To construct the data, seven geometric parameters of wheat kernels were measured: \n",
    "1. area A, \n",
    "2. perimeter P, \n",
    "3. compactness C = 4*pi*A/P^2, \n",
    "4. length of kernel, \n",
    "5. width of kernel, \n",
    "6. asymmetry coefficient \n",
    "7. length of kernel groove. \n",
    "All of these parameters were real-valued continuous.\n",
    "\n",
    "Let's see if we can cluster them in to 3 groups with K-means!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder.appName('cluster').getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.clustering import KMeans\n",
    "\n",
    "# Loads data.\n",
    "dataset = spark.read.csv(\"seeds_dataset.csv\",header=True,inferSchema=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+\n",
      "|summary|              area|         perimeter|         compactness|   length_of_kernel|   width_of_kernel|asymmetry_coefficient|   length_of_groove|\n",
      "+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+\n",
      "|  count|               210|               210|                 210|                210|               210|                  210|                210|\n",
      "|   mean|14.847523809523816|14.559285714285718|  0.8709985714285714|  5.628533333333335| 3.258604761904762|   3.7001999999999997|  5.408071428571429|\n",
      "| stddev|2.9096994306873647|1.3059587265640225|0.023629416583846364|0.44306347772644983|0.3777144449065867|   1.5035589702547392|0.49148049910240543|\n",
      "|    min|             10.59|             12.41|              0.8081|              4.899|              2.63|                0.765|              4.519|\n",
      "|    max|             21.18|             17.25|              0.9183|              6.675|             4.033|                8.456|               6.55|\n",
      "+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "dataset.describe().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Format the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.linalg import Vectors\n",
    "from pyspark.ml.feature import VectorAssembler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['area',\n",
       " 'perimeter',\n",
       " 'compactness',\n",
       " 'length_of_kernel',\n",
       " 'width_of_kernel',\n",
       " 'asymmetry_coefficient',\n",
       " 'length_of_groove']"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "vec_assembler = VectorAssembler(inputCols = dataset.columns, outputCol='features')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "final_data = vec_assembler.transform(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scale the Data\n",
    "It is a good idea to scale our data to deal with the curse of dimensionality: https://en.wikipedia.org/wiki/Curse_of_dimensionality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import StandardScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scaler = StandardScaler(inputCol=\"features\", outputCol=\"scaledFeatures\", withStd=True, withMean=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Compute summary statistics by fitting the StandardScaler\n",
    "scalerModel = scaler.fit(final_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Normalize each feature to have unit standard deviation.\n",
    "final_data = scalerModel.transform(final_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the Model and Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Trains a k-means model.\n",
    "kmeans = KMeans(featuresCol='scaledFeatures',k=3)\n",
    "model = kmeans.fit(final_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Within Set Sum of Squared Errors = 429.07559671506715\n"
     ]
    }
   ],
   "source": [
    "# Evaluate clustering by computing Within Set Sum of Squared Errors.\n",
    "wssse = model.computeCost(final_data)\n",
    "print(\"Within Set Sum of Squared Errors = \" + str(wssse))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cluster Centers: \n",
      "[  6.31670546  12.37109759  37.39491396  13.91155062   9.748067\n",
      "   2.39849968  12.2661748 ]\n",
      "[  4.87257659  10.88120146  37.27692543  12.3410157    8.55443412\n",
      "   1.81649011  10.32998598]\n",
      "[  4.06105916  10.13979506  35.80536984  11.82133095   7.50395937\n",
      "   3.27184732  10.42126018]\n"
     ]
    }
   ],
   "source": [
    "# Shows the result.\n",
    "centers = model.clusterCenters()\n",
    "print(\"Cluster Centers: \")\n",
    "for center in centers:\n",
    "    print(center)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+\n",
      "|prediction|\n",
      "+----------+\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         0|\n",
      "|         0|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         1|\n",
      "|         2|\n",
      "+----------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "model.transform(final_data).select('prediction').show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you are ready for your consulting Project!\n",
    "# Great Job!"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
